Loading [MathJax]/extensions/MathZoom.js
研究工具箱更新——估计系数内生性的敏感性判断
In [1]:
import ipystata
from ipystata.config import config_stata 
config_stata('D:\Program Files (x86)\Stata14\StataSE-64.exe')  

1. 在因果关系框架下,估计系数稳健性的敏感性分析

基于(Frank [2000]; Pan and Frank [2003]; Frank and Min [2007]; and Frank et al.[2013])系列文献,目前发展出了两类估计系数收到内生性影响的稳健性检验方法。


> 第一类使用 Rubin的因果模型框架,基于反事实替换现有到的个体,需要有多大的偏差才能使推断失效,如果这个偏差比例越小,则说明现有的推断稳健性可能存在问题。

> 第二类依据回归分析框架中无法观测的变量与核心解释变量和因变量的相关性,以此量化因果推断之稳健性。

第一类

在使用观测数据或者准自然实验进行因果探究研究中,一个核心关注就是遗漏变量导致的推断问题。

不管什么研究,总是可能存在一些混淆变量(confounding variables),这些混淆因素同时影响结果(outcome)和解释变量(predictor),这可能导致推断是无效的(invalidate inference)。

>一个例子: father’s occupation (X) and one’s own educational attainment (Y ), an omitted confounding variable might be one’s father’s education (cv).

Frank (2000) then shows how strongly an omitted confounding variable (cv) would have to be correlated with the predictor (father’s occupation, X) and the outcome (educational attainment, Y ) to invalidate an inference of the effect of X on Y.

CV如何影响回归系数.png

考虑一个最简单的二元回归

Y=β0+β1X+e
t(β^1)=ryx1ryx2nq1

控制混淆变量CV后X、Y的偏相关系数:cvryx|cv

ryx|cv=ryxrycvrxcv1rycv21rxcv2

其中n是样本总量,q是待估计参数,ryx是y和x的相关系数

如果要影响已有的推断(降低已有推断的有效性),我们考虑一种情况降低rxy|cv一个门槛r#

r#=tcritical (nq1)+tcritical 2

tcritical 一般选择1.96,有了 tcritical 我们可以计算出r#


假设rxcv=rycv,CV对y和对x的相关系数是一样的

impact =rxcv×rycv=rxcv×rxcv=rycv×rycv

ryx|cv=ryxrycvrxcv1ryv21rxcv2=ryx impact 1 impact =r#

解出impact

 impact =ryxr#1|r#|

那么如果要影响之前推断的有效性,需要混淆变量CV的影响必须大于

ryxr#1|r#|

有了前面的r#,我们可以退出impact的值


第二类估计系数因果推断稳健性检验思路:判定一个估计系数中有多少部分是由于偏差(bias)而导致推断无效的。

对于这种方法的一个直观的理解,可以考虑某个估计值与其阈值进行比较,我们需要多大偏差才能更改或者推翻已有推断。

方法二

入上图所示,两个研究A和B具有相同的阈值假设这个值为4,那么对于A研究来说得到的估计值是6,对于B研究来说得到的估计值是8。两个研究都超过了基础阈值4。但是显然研究B的估计结果超过阈值的程度要大于研究A,假设A和B研究设计中选择偏差的控制水平和精度水平相似的情况下获得估计值。

> 因此,我们可以得出研究B给出的推断(估计结果)要比研究A更稳健(robust),主要原因是要使得B的推断结果无效需要更大比例的偏误。

更为正式的表达式如下,假设存在总体效应为δ,估计得到的效应为δ^,同时假设效应的阈值为δ#

当总体效应为正的情况下,出现以下条件,意味着推断失效

δ^>δ#>δ

An inference is invalid if the estimate is greater than the threshold while the population value is less than the threshold.

对上述不等式进行移项:

δ^δ>δ^δ#>0

定义:

bias(δ^)=δ^δ

那么要使得推断无效,bias需要比δ^δ#这个大,这个东西就是推断的估计效应大于阈值的部分

bias(δ^) 导致违反推断 >δ^δ#

转换为百分比

% bias (δ^) 导致违反推断 =bias(δ^)δ^>δ^δ#δ^=1δ#δ^

对于上面的例子,研究A的结果是1-4/6=33%,对于研究B的结果是1-4/8=50%,也就是说如果要推翻B的结论需要50%的bias,而推翻A只需要33%的bias。

In [2]:
%%stata
*ssc install konfound
*ssc install indeplist
*ssc install moss
*ssc install matsort 
use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta,clear
reg lw s expr tenure rns smsa i.year iq 
checking konfound consistency and verifying not already installed...
all files already exist and are up to date.

checking indeplist consistency and verifying not already installed...
all files already exist and are up to date.

checking moss consistency and verifying not already installed...
all files already exist and are up to date.

checking matsort consistency and verifying not already installed...
all files already exist and are up to date.

(Wages of Very Young Men, Zvi Griliches, J.Pol.Ec. 1976)

      Source |       SS           df       MS      Number of obs   =       758
-------------+----------------------------------   F(12, 745)      =     46.86
       Model |  59.9127611        12  4.99273009   Prob > F        =    0.0000
    Residual |  79.3733888       745  .106541461   R-squared       =    0.4301
-------------+----------------------------------   Adj R-squared   =    0.4210
       Total |   139.28615       757  .183997556   Root MSE        =    .32641

------------------------------------------------------------------------------
          lw |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           s |   .0619548   .0072786     8.51   0.000     .0476658    .0762438
        expr |   .0308395   .0065101     4.74   0.000     .0180592    .0436198
      tenure |   .0421631   .0074812     5.64   0.000     .0274763    .0568498
         rns |  -.0962935   .0275467    -3.50   0.001    -.1503719   -.0422151
        smsa |   .1328993   .0265758     5.00   0.000     .0807268    .1850717
             |
        year |
         67  |  -.0542095   .0478522    -1.13   0.258    -.1481506    .0397317
         68  |   .0805808   .0448951     1.79   0.073    -.0075551    .1687168
         69  |   .2075915   .0438605     4.73   0.000     .1214867    .2936963
         70  |   .2282237   .0487994     4.68   0.000      .132423    .3240245
         71  |   .2226915   .0430952     5.17   0.000     .1380889     .307294
         73  |   .3228747   .0406574     7.94   0.000     .2430579    .4026915
             |
          iq |   .0027121   .0010314     2.63   0.009     .0006873    .0047369
       _cons |   4.235357   .1133489    37.37   0.000     4.012836    4.457878
------------------------------------------------------------------------------

假设我们怀疑iq估计系数可能收到不可观测变量的影响

In [3]:
%%stata
konfound iq
------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For iq:
To invalidate the inference 25.34% of the estimate would have to be due to bias
inference 25.34% (192) cases would have to be replaced with cases for which the
------------------
Impact Threshold for Omitted Variable

For iq:
An omitted variable would have to be correlated at 0.161 with the outcome and a
of interest (conditioning on observed covariates) to invalidate an inference. 
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.161 x 0.161=0.0260 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for iq

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .5131 |     .5027 |      .258 |
|         smsa |     .0992 |     .2156 |     .0214 |
|          rns |    -.1339 |    -.1496 |       .02 |
|       tenure |     .0194 |     .1638 |     .0032 |
|         expr |    -.1663 |     .0846 |    -.0141 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .4883 |      .539 |     .2632 |
|          rns |    -.1094 |    -.1059 |     .0116 |
|       tenure |     .0602 |     .1654 |       .01 |
|         smsa |     .0346 |     .1785 |     .0062 |
|         expr |    -.0642 |     .2146 |    -.0138 |
+--------------------------------------------------+

X represents iq, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa

如果我们希望计算不考虑控制变量的CV影响和相关部分的关联度,可以使用

In [4]:
%%stata
reg lw s expr tenure rns smsa i.year iq 
konfound iq,uncond(1)
      Source |       SS           df       MS      Number of obs   =       758
-------------+----------------------------------   F(12, 745)      =     46.86
       Model |  59.9127611        12  4.99273009   Prob > F        =    0.0000
    Residual |  79.3733888       745  .106541461   R-squared       =    0.4301
-------------+----------------------------------   Adj R-squared   =    0.4210
       Total |   139.28615       757  .183997556   Root MSE        =    .32641

------------------------------------------------------------------------------
          lw |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           s |   .0619548   .0072786     8.51   0.000     .0476658    .0762438
        expr |   .0308395   .0065101     4.74   0.000     .0180592    .0436198
      tenure |   .0421631   .0074812     5.64   0.000     .0274763    .0568498
         rns |  -.0962935   .0275467    -3.50   0.001    -.1503719   -.0422151
        smsa |   .1328993   .0265758     5.00   0.000     .0807268    .1850717
             |
        year |
         67  |  -.0542095   .0478522    -1.13   0.258    -.1481506    .0397317
         68  |   .0805808   .0448951     1.79   0.073    -.0075551    .1687168
         69  |   .2075915   .0438605     4.73   0.000     .1214867    .2936963
         70  |   .2282237   .0487994     4.68   0.000      .132423    .3240245
         71  |   .2226915   .0430952     5.17   0.000     .1380889     .307294
         73  |   .3228747   .0406574     7.94   0.000     .2430579    .4026915
             |
          iq |   .0027121   .0010314     2.63   0.009     .0006873    .0047369
       _cons |   4.235357   .1133489    37.37   0.000     4.012836    4.457878
------------------------------------------------------------------------------

------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For iq:
To invalidate the inference 25.34% of the estimate would have to be due to bias
inference 25.34% (192) cases would have to be replaced with cases for which the
------------------
Impact Threshold for Omitted Variable

For iq:
An omitted variable would have to be correlated at 0.161 with the outcome and a
of interest (conditioning on observed covariates) to invalidate an inference. 
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.161 x 0.161=0.0260 to invalidate an inference.

For iq:
An omitted variable would have to be correlated at 0.122 with the outcome and a
of interest (before conditioning on observed covariates) to invalidate an infer
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.122  x 0.136=0.0167 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for iq

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .5131 |     .5027 |      .258 |
|         smsa |     .0992 |     .2156 |     .0214 |
|          rns |    -.1339 |    -.1496 |       .02 |
|       tenure |     .0194 |     .1638 |     .0032 |
|         expr |    -.1663 |     .0846 |    -.0141 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .4883 |      .539 |     .2632 |
|          rns |    -.1094 |    -.1059 |     .0116 |
|       tenure |     .0602 |     .1654 |       .01 |
|         smsa |     .0346 |     .1785 |     .0062 |
|         expr |    -.0642 |     .2146 |    -.0138 |
+--------------------------------------------------+

X represents iq, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa

对所有自变量进行分析

In [5]:
%%stata
qui reg lw s expr tenure rns smsa i.year iq 
konfound iq s expr tenure rns smsa
------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For iq:
To invalidate the inference 25.34% of the estimate would have to be due to bias
inference 25.34% (192) cases would have to be replaced with cases for which the
------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For s:
To invalidate the inference 76.94% of the estimate would have to be due to bias
inference 76.94% (583) cases would have to be replaced with cases for which the
------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For expr:
To invalidate the inference 58.56% of the estimate would have to be due to bias
inference 58.56% (444) cases would have to be replaced with cases for which the
------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For tenure:
To invalidate the inference 65.17% of the estimate would have to be due to bias
inference 65.17% (494) cases would have to be replaced with cases for which the
------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For rns:
To invalidate the inference 43.84% of the estimate would have to be due to bias
inference 43.84% (332) cases would have to be replaced with cases for which the
------------------
The Threshold for % Bias to Invalidate/Sustain the Inference

For smsa:
To invalidate the inference 60.74% of the estimate would have to be due to bias
inference 60.74% (460) cases would have to be replaced with cases for which the
------------------
Impact Threshold for Omitted Variable

For iq:
An omitted variable would have to be correlated at 0.161 with the outcome and a
of interest (conditioning on observed covariates) to invalidate an inference. 
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.161 x 0.161=0.0260 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for iq

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .5131 |     .5027 |      .258 |
|         smsa |     .0992 |     .2156 |     .0214 |
|          rns |    -.1339 |    -.1496 |       .02 |
|       tenure |     .0194 |     .1638 |     .0032 |
|         expr |    -.1663 |     .0846 |    -.0141 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .4883 |      .539 |     .2632 |
|          rns |    -.1094 |    -.1059 |     .0116 |
|       tenure |     .0602 |     .1654 |       .01 |
|         smsa |     .0346 |     .1785 |     .0062 |
|         expr |    -.0642 |     .2146 |    -.0138 |
+--------------------------------------------------+

X represents iq, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa
------------------
Impact Threshold for Omitted Variable

For s:
An omitted variable would have to be correlated at 0.494 with the outcome and a
of interest (conditioning on observed covariates) to invalidate an inference. 
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.494 x 0.494=0.2436 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for s

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|           iq |     .5131 |     .3471 |     .1781 |
|         smsa |     .1021 |     .2156 |      .022 |
|          rns |    -.0648 |    -.1496 |     .0097 |
|       tenure |    -.0496 |     .1638 |    -.0081 |
|         expr |    -.2418 |     .0846 |    -.0205 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|           iq |     .4883 |     .3439 |     .1679 |
|         smsa |     .0596 |     .1833 |     .0109 |
|          rns |     .0095 |    -.0796 |    -.0008 |
|       tenure |    -.0286 |     .1302 |    -.0037 |
|         expr |    -.1725 |     .1258 |    -.0217 |
+--------------------------------------------------+

X represents s, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa
------------------
Impact Threshold for Omitted Variable

For expr:
An omitted variable would have to be correlated at 0.327 with the outcome and a
of interest (conditioning on observed covariates) to invalidate an inference. 
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.327 x 0.327=0.1070 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for expr

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|       tenure |     .2307 |     .1638 |     .0378 |
|          rns |     .0058 |    -.1496 |    -.0009 |
|         smsa |    -.0332 |     .2156 |    -.0071 |
|           iq |    -.1663 |     .3471 |    -.0577 |
|            s |    -.2418 |     .5027 |    -.1216 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|       tenure |      .229 |     .2095 |      .048 |
|          rns |     -.011 |    -.0941 |      .001 |
|         smsa |    -.0161 |     .1681 |    -.0027 |
|           iq |    -.0642 |     .0933 |     -.006 |
|            s |    -.1725 |     .4208 |    -.0726 |
+--------------------------------------------------+

X represents expr, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa
------------------
Impact Threshold for Omitted Variable

For tenure:
An omitted variable would have to be correlated at 0.375 with the outcome and a
of interest (conditioning on observed covariates) to invalidate an inference. 
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.375 x 0.375=0.1407 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for tenure

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|         expr |     .2307 |     .0846 |     .0195 |
|         smsa |     .0331 |     .2156 |     .0071 |
|           iq |     .0194 |     .3471 |     .0067 |
|          rns |    -.0366 |    -.1496 |     .0055 |
|            s |    -.0496 |     .5027 |    -.0249 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|         expr |      .229 |     .2593 |     .0594 |
|           iq |     .0602 |     .1181 |     .0071 |
|         smsa |     .0338 |     .1788 |      .006 |
|          rns |    -.0257 |    -.0969 |     .0025 |
|            s |    -.0286 |     .4451 |    -.0127 |
+--------------------------------------------------+

X represents tenure, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa
------------------
Impact Threshold for Omitted Variable

For rns:
An omitted variable would have to be correlated at 0.244 with the outcome and a
of interest (conditioning on observed covariates. signs are interchangeable) to
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
-.244 x 0.244=-0.0596 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for rns

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|         expr |     .0058 |     .0846 |     .0005 |
|       tenure |    -.0366 |     .1638 |     -.006 |
|            s |    -.0648 |     .5027 |    -.0326 |
|         smsa |    -.1611 |     .2156 |    -.0347 |
|           iq |    -.1339 |     .3471 |    -.0465 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .0095 |     .4506 |     .0043 |
|         expr |     -.011 |      .222 |    -.0024 |
|       tenure |    -.0257 |     .1614 |    -.0042 |
|           iq |    -.1094 |     .1201 |    -.0131 |
|         smsa |    -.1496 |     .1904 |    -.0285 |
+--------------------------------------------------+

X represents rns, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa
------------------
Impact Threshold for Omitted Variable

For smsa:
An omitted variable would have to be correlated at 0.342 with the outcome and a
of interest (conditioning on observed covariates) to invalidate an inference. 
Correspondingly the impact of an omitted variable (as defined in Frank 2000) mu
0.342 x 0.342=0.1169 to invalidate an inference.

These thresholds can be compared with the impacts of observed covariates below.

Observed Impact Table for smsa

+--------------------------------------------------+
|          Raw |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .1021 |     .5027 |     .0513 |
|           iq |     .0992 |     .3471 |     .0344 |
|          rns |    -.1611 |    -.1496 |     .0241 |
|       tenure |     .0331 |     .1638 |     .0054 |
|         expr |    -.0332 |     .0846 |    -.0028 |
+--------------------------------------------------+

+--------------------------------------------------+
|      Partial |    Cor(v, |    Cor(v, |           |
|--------------+-----------+-----------+-----------|
|            s |     .0596 |     .4553 |     .0271 |
|          rns |    -.1496 |    -.1197 |     .0179 |
|       tenure |     .0338 |     .1631 |     .0055 |
|           iq |     .0346 |     .1142 |      .004 |
|         expr |    -.0161 |     .2161 |    -.0035 |
+--------------------------------------------------+

X represents smsa, Y represents lw, v represents each covariate.
First table is based on unconditional correlations, second table is based on pa

简单的结果解读:

1. 天真的ols估计结果告诉我们,变量s, expr, tenure, rns, smsa,iq对lw对数工资都有显著的影响。

2. 通过检验,我们发现如果使得ols估计结果不可靠,对于上述变量,必须的bias分别为76.94% 、58.56% 、 65.17% 、 43.84%、60.74% 、25.34% 。可以看出iq受到混淆变量的影响可能是严重的,需要引起我们的注意。

3. 通过下面的图我们也可以看出,就红蓝比例而言,iq也是最小的。


一个简单的结论是:我们需要对iq可能存在的内生性进行进一步处理,因此可以考虑使用简单的iv,这里就直接使用ivreg2命令里的例子来做。

In [13]:
%%stata
*ssc install outreg2
*ssc install estout
qui reg lw s expr tenure rns smsa i.year iq
estadd local control 'yes'
est store m1
qui ivreg2 lw s expr tenure rns smsa i.year (iq=med kww age mrt)
estadd local control 'yes'
est store m2
esttab m1 m2 ,mtitle('ols' 'iv')  keep(iq) b(%6.3f) nogap compress  star(* 0.1 ** 0.05 *** 0.01) scalar(N control)
added macro:
            e(control) : "'yes'"

added macro:
            e(control) : "'yes'"

------------------------------------
                 (1)          (2)   
               'ols'         'iv'   
------------------------------------
iq             0.003***     0.000   
              (2.63)       (0.04)   
------------------------------------
N                758          758   
control        'yes'        'yes'   
------------------------------------
t statistics in parentheses
* p<0.1, ** p<0.05, *** p<0.01